the context, (b) a set of conditions, which are the relations that
exist between the referents.
DRT provides a very logical platform for the representation
of semantic structures of sentences including complex
predicates like implications, propositions and negations, etc. It
is also able to separately localize almost every kind of events
and find out their agents and patients.
Here is an example of a DRS representation of the sentence
“He drinks water.”. Here, x1, x2, and x3 are the referents and
male(x1), water(x2), drink(x3), event(x3), agent(x3, x1),
patient(x3, x2) are the conditions
[ x1, x2, x3:
male(x1), water(x2), drink(x3),
event(x3), agent(x3, x1), patient(x3, x2) ]
B. Brief description of GML
Graph Modeling Language (GML) [7] is a simple and
efficient format for representing weighted directed graphs. A
GML file is primarily a 7-bit ASCII file. Its simple format
allows us to read, parse, and write without much hassle.
Moreover, several open source software systems are available
for viewing and editing GML files.
Graphs are represented using several keys like “graph”,
“node”, “edge” etc. while nodes have “id” associated with them
which are later referenced from the “source” and “target”
attributes. Edge weights are represented through “label”
attribute associated with an edge key.
C. k-NN classifier
The k-nearest-neighbour is one among the most simple and
popular machine learning algorithms. These kinds of classifiers
depend solely on the class labels of the training examples that
are similar to the test example instead of building explicit class
representation. Distance measures such as Euclidean distance,
Manhattan distance are generally used to compare the
similarity between two examples. In standard k-NN algorithm
the majority vote of its neighbours are used to classify a new
example. Usually, the number of neighbours (value of k) is
determined empirically to obtain best results.
D. Distance metrics for graphs
As we have mentioned before, the goal of our current work
is to make a comparative analysis of different kinds of distance
metrics for text classification task.
We have taken five different distance metrics from [6],
which are used in this work. They are popularly used in object
recognition task, but for text categorization they have not been
used popularly. For two graph
and
, if
is the
dissimilarity/similarity measure, then
would be a
distance, if has the following properties:
(i)
(ii)
(iii)
The measures that are involved in the current work follow
the above rules. The corresponding distance metrics for these
measures are:
In the above equations, and
denote maximal common subgraph and minimum common
super graphs of two graphs and . Theoretically
is the largest graph in terms of no. of edges which
is isomorphic to a subgraph of both and
. The
has been formally defined in the work of Bunk et al. [8].
As stated earlier, finding the maximum common subgraph is
a NP complete problem and, the algorithm of finding the
is actually a brute force method, which first finds all the
subgraphs of both the graphs and select the graph of maximum
size which is common to both and . To increase
computational speed of the program, it is modified to an
approximate version of actual residing on the fact
that the nodes that possess a greater similarity in their local
neighborhood of the two graphs have a larger probability of
inclusion in the . The two stage approach used in the present
work to form the approximate is as follows:
1. All the node pairs (one from each graph) are sorted
according to the decreasing order of their similarity of
local structures. In the present case, the number of self-
loops which have equal labels in both the graphs is used
for similarity measures.
2. Build the mcs by first adding each self-loop vertex pair
(starting with the one with the highest no. of matching
labels) and considering it as an equivalent vertex, then
include the rest of the edges (non-self-loop edges) which
satisfy the chosen self-loops in both the graphs.
In this way it can be ensured that the approximation version
possesses most of the properties of a mcs, while complexity is
contained within a polynomial upper bound.
The minimum common supergraph () [4] is formed
using the union of two graphs, i.e.
.
The distance metrics of Equations 1, 2, and 5 were used without
modification, but those of Equations 3–4 were divided by
and , respectively
to make them normalized, keeping the value of distance metrics
in the range
E. Tools
In order to extract DRS from summarized texts we used
“C&C” and “Boxer” [10] [11], which are very popular open
source tools available for download at http://svn.ask.it.usyd.
edu.au/trac/candc. The tools consist of a combinatory
categorical grammar (CCG) [9] parser and outputs the semantic
52Polibits (49) 2014 ISSN 1870-9044
Nibaran Das, Swarnendu Ghosh, Teresa Gonçalves, Paulo Quaresma